Skip to content

Conversation

@Gadflyii
Copy link
Contributor

@Gadflyii Gadflyii commented Sep 28, 2025

This change adds a new toggle, "--no-host" that will allow the extra buft to remain functional when a GPU is present, enabling AMX operations when in a CPU/GPU hybrid. If the "--amx" toggle is not present, current behavior is maintained.

  • The toggle is functional in llama-bench, llama-cli, and llama-server.
  • Compatible with all --cpu-moe, --n-cpu-moe N, --cpu-moe-draft, and --n-cpu-moe-draft N as implemented on Sep 27th, 2025.
  • Compatible with all Sapphire Rapids, Emerald Rapids, and Granite rapids CPU's.
  • If "--no-host" is accidentally enabled on non-Intel CPU's, or Intel CPU's without AMX, there is no change in behavior (Tested with AMD 9950X3D & Intel 14900k).
  • Works in WSL or native Linux (Tested Ubuntu 24.04 LTS and Windows 11 + Ubuntu WSL).

This change allows significant performance increases on the CPU offloaded layers / moe while in hybrid operations; especially in prompt eval, where with 100%-150%+ performance uplifts are common:

Examples:

Base command:

numactl -N 2,3 -m 2,3 ~/src/llama.cpp/build/bin/llama-cli -m /mnt/ssd2/AI/Qwen3_30B/Q4_0/Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf  -t 64 -b 1024 -c 1024 -n 1024 --numa numactl -p "The quick brown fox jumps over the lazy dog many times. A curious cat watches carefully from the garden wall nearby. Birds sing softly in the morning air, while the sun rises gently above the hills. Children walk slowly to school carrying bright backpacks filled with books, pencils, and small notes. The teacher greets them warmly at the classroom door. Lessons begin with stories about science, history, art, and music. Ideas flow clearly and simply, creating a calm rhythm of learning. Friends share smiles, trade sandwiches, and laugh during the short break. The day continues peacefully until the afternoon bell finally rings." -no-cnv --n-gpu-layers 10

No AMX (Current behavior):

llama_perf_sampler_print:    sampling time =      91.57 ms /   927 runs   (    0.10 ms per token, 10123.18 tokens per second)
llama_perf_context_print:        load time =    1202.46 ms
llama_perf_context_print: prompt eval time =    1020.54 ms /   122 tokens (    8.37 ms per token,   119.54 tokens per second)
llama_perf_context_print:        eval time =   22999.39 ms /   804 runs   (   28.61 ms per token,    34.96 tokens per second)
llama_perf_context_print:       total time =   24432.43 ms /   926 tokens
llama_perf_context_print:    graphs reused =        800
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 26593 + ( 3915 =  3351 +      20 +     544) +        1578 |
llama_memory_breakdown_print: |   - Host               |                  13299 = 13217 +      76 +       6                |

W/ "--no-host":

llama_perf_sampler_print:    sampling time =     100.54 ms /  1024 runs   (    0.10 ms per token, 10185.00 tokens per second)
llama_perf_context_print:        load time =    9185.60 ms
llama_perf_context_print: prompt eval time =     478.09 ms /   122 tokens (    3.92 ms per token,   255.18 tokens per second)
llama_perf_context_print:        eval time =   22453.23 ms /   901 runs   (   24.92 ms per token,    40.13 tokens per second)
llama_perf_context_print:       total time =   23289.67 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 26885 + ( 3571 =  3351 +      20 +     200) +        1629 |
llama_memory_breakdown_print: |   - Host               |                  13243 = 12866 +      76 +     300                |
llama_memory_breakdown_print: |   - CPU_REPACK         |                  11664 = 11664 +       0 +       0                |
llama_memory_breakdown_print: |   - AMX                |                    628 =   628 +       0 +       0                |

Results:

Prompt Evaluation | 119.54 tps | 255.18 tps | +135.64 | +113.47%
Token Evaluation | 34.96 tps | 40.13 tps | +5.17 | +14.79%
Overall Inference | 37.90 tps | 43.93 tps | +6.02 | +15.90%
Sampling | 10123.18 tps | 10185.00 tps | +61.82 | +0.61%

With "--cpu-moe":

No AMX (Current behavior):

llama_perf_sampler_print:    sampling time =     102.79 ms /  1024 runs   (    0.10 ms per token,  9961.96 tokens per second)
llama_perf_context_print:        load time =     615.24 ms
llama_perf_context_print: prompt eval time =    1198.63 ms /   122 tokens (    9.82 ms per token,   101.78 tokens per second)
llama_perf_context_print:        eval time =   27418.81 ms /   901 runs   (   30.43 ms per token,    32.86 tokens per second)
llama_perf_context_print:       total time =   29076.75 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 29776 + (  675 =   111 +      20 +     544) +        1634 |
llama_memory_breakdown_print: |   - Host               |                  16651 = 16569 +      76 +       6                |

W/ "--no-host":

llama_perf_sampler_print:    sampling time =     100.07 ms /  1024 runs   (    0.10 ms per token, 10232.84 tokens per second)
llama_perf_context_print:        load time =   10873.17 ms
llama_perf_context_print: prompt eval time =     530.19 ms /   122 tokens (    4.35 ms per token,   230.11 tokens per second)
llama_perf_context_print:        eval time =   23928.42 ms /   901 runs   (   26.56 ms per token,    37.65 tokens per second)
llama_perf_context_print:       total time =   24809.11 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 30115 + (  331 =   111 +      20 +     200) +        1640 |
llama_memory_breakdown_print: |   - Host               |                  13243 = 12866 +      76 +     300                |
llama_memory_breakdown_print: |   - CPU_REPACK         |                  14904 = 14904 +       0 +       0                |
llama_memory_breakdown_print: |   - AMX                |                    628 =   628 +       0 +       0                |

Results:

Prompt Evaluation | 101.78 tps | 230.11 tps | +128.33 | +126.06%
Token Evaluation | 32.86 tps | 37.65 tps | +4.79 | +14.58%
Overall Inference | 35.18 tps | 41.24 tps | +6.06 | +17.23%
Sampling | 9961.96 tps | 10232.84 tps | +270.88 | +2.72%

@Gadflyii
Copy link
Contributor Author

Let me know if you have any questions.

@slaren
Copy link
Member

slaren commented Sep 28, 2025

This should already be possible with the more generic command line option -nr, --no-repack.

Nvm that, that option does the opposite. I think the better solution would be to add an option to disable host buffer types in make_cpu_buft_list.

@Gadflyii
Copy link
Contributor Author

This should already be possible with the more generic command line option -nr, --no-repack.

Nvm that, that option does the opposite. I think the better solution would be to add an option to disable host buffer types in make_cpu_buft_list.

I have played with that little, but found I couldn't get it to work / work as expected. I think it is due to how the extra bufts are implemented as part of the original AMX PR. Not all the CPU weights go into the CPU_REPACK / AMX bufts, so I think we need to maintain the CPU_Mapped model buffer + the extra bufts CPU_REPACK and AMX?

Is that what you meant?

@slaren
Copy link
Member

slaren commented Sep 28, 2025

What I mean is adding an option to skip adding the host buffer types here:

// add a host buffer type
// storing the tensors in a host buffer is useful when the processing of large batches
// is offloaded to a GPU device, since it reduces the time spent on data transfers
// generally, this will be done using the first device in the list
// a better approach would be to handle this on a weight-by-weight basis using the offload_op
// function of the device to determine if it would benefit from being stored in a host buffer
for (auto * dev : devices) {
ggml_backend_buffer_type_t buft = ggml_backend_dev_host_buffer_type(dev);
if (buft) {
buft_list.emplace_back(dev, buft);
break;
}
}

The reason the extra buffer types don't get used when there is a GPU, is because the host buffer types have higher priority. Alternatively, the option could give repack buffers higher priority, but still keep the host buffer types.

@Gadflyii
Copy link
Contributor Author

What I mean is adding an option to skip adding the host buffer types here:

// add a host buffer type
// storing the tensors in a host buffer is useful when the processing of large batches
// is offloaded to a GPU device, since it reduces the time spent on data transfers
// generally, this will be done using the first device in the list
// a better approach would be to handle this on a weight-by-weight basis using the offload_op
// function of the device to determine if it would benefit from being stored in a host buffer
for (auto * dev : devices) {
ggml_backend_buffer_type_t buft = ggml_backend_dev_host_buffer_type(dev);
if (buft) {
buft_list.emplace_back(dev, buft);
break;
}
}

The reason the extra buffer types don't get used when there is a GPU, is because the host buffer types have higher priority. Alternatively, the option could give repack buffers higher priority, but still keep the host buffer types.

I will make the change and update the PR

@Gadflyii
Copy link
Contributor Author

@slaren any feedback on what the "opt-in" switch should be called? I can keep it "--amx" or I can make it more generic, something like "--xbuffers" in case there are any other extra buffers added in the future?

@slaren
Copy link
Member

slaren commented Sep 29, 2025

I am not sure what would be the opt-in switch. What I am proposing is a flag to disable host buffer types, and it should be called something like --no-host.

@Gadflyii
Copy link
Contributor Author

@slaren all changes have been made.

All feedback welcomed, and thank you for all your help.

@Gadflyii
Copy link
Contributor Author

Gadflyii commented Oct 6, 2025

@slaren Just wanted to circle back to this, is there anything you would like me to change, or do you have any questions?

@slaren
Copy link
Member

slaren commented Oct 6, 2025

Looks good, please add a note about the change to llama_model_params in #9289 when you have a chance after this is merged.

@Gadflyii
Copy link
Contributor Author

Gadflyii commented Oct 6, 2025

Looks good, please add a note about the change to llama_model_params in #9289 when you have a chance after this is merged.

Will do, and again, thank you for your help, it is really appreciated.

@slaren slaren merged commit 3df2244 into ggml-org:master Oct 6, 2025
60 checks passed
@DocShotgun
Copy link
Contributor

DocShotgun commented Oct 7, 2025

Would there be a performance benefit to keeping the ability to allocate a host buffer, but simply allowing the repack buffer to take priority for the relevant tensors?

EDIT: --no-host seems to hurt prompt processing performance on my Xeon AMX-enabled CPU when doing CPU+GPU inference. Does --no-host disable the ability to offload all prompt processing to the GPU when ingesting large batches?

@slaren
Copy link
Member

slaren commented Oct 7, 2025

Repacking disables GPU offloading, that's why normally it isn't done when there is a GPU.

yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <[email protected]>
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants